Read Ramen Ratings dataset by The Ramen Rater. (https://github.com/rfordatascience/tidytuesday/tree/master/data/2019/2019-06-04)
ramen<- read.csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-06-04/ramen_ratings.csv")
head(ramen)
str(ramen)
## 'data.frame': 3180 obs. of 6 variables:
## $ review_number: int 3180 3179 3178 3177 3176 3175 3174 3173 3172 3171 ...
## $ brand : chr "Yum Yum" "Nagatanien" "Acecook" "Maison de Coree" ...
## $ variety : chr "Tem Tem Tom Yum Moo Deng" "tom Yum Kung Rice Vermicelli" "Kelp Broth Shio Ramen" "Ramen Gout Coco Poulet" ...
## $ style : chr "Cup" "Pack" "Cup" "Cup" ...
## $ country : chr "Thailand" "Japan" "Japan" "France" ...
## $ stars : num 3.75 2 2.5 3.75 5 3.5 3.75 5 3.5 4.25 ...
library(dplyr)
ramen$country%>%table()
## .
## Australia Bangladesh Brazil Cambodia Canada
## 25 11 12 5 48
## China Colombia Dubai Estonia Fiji
## 207 6 3 2 4
## Finland France Germany Ghana Holland
## 3 4 28 2 4
## Hong Kong Hungary India Indonesia Italy
## 155 9 41 150 3
## Japan Malaysia Mexico Myanmar Nepal
## 532 182 28 14 14
## Netherlands New Zealand Nigeria Pakistan Philippines
## 16 1 2 9 49
## Phlippines Poland Russia Sarawak Singapore
## 1 6 3 5 134
## South Korea Sweden Taiwan Thailand UK
## 357 3 330 205 69
## Ukraine United States USA Vietnam
## 3 382 1 112
sort(table(ramen$country), decreasing=T)
##
## Japan United States South Korea Taiwan China
## 532 382 357 330 207
## Thailand Malaysia Hong Kong Indonesia Singapore
## 205 182 155 150 134
## Vietnam UK Philippines Canada India
## 112 69 49 48 41
## Germany Mexico Australia Netherlands Myanmar
## 28 28 25 16 14
## Nepal Brazil Bangladesh Hungary Pakistan
## 14 12 11 9 9
## Colombia Poland Cambodia Sarawak Fiji
## 6 6 5 5 4
## France Holland Dubai Finland Italy
## 4 4 3 3 3
## Russia Sweden Ukraine Estonia Ghana
## 3 3 3 2 2
## Nigeria New Zealand Phlippines USA
## 2 1 1 1
Wait, Holland, Netherlands, United States, USA? Let’s keep them in
mind and check for NAs first.
ramen%>%is.na()%>%table
## .
## FALSE TRUE
## 19065 15
Print out the rows with NA values.
ramen[!complete.cases(ramen),]
Netherlands, Holland will be removed since the sum of those two are less then 200.
However, USA would remain since it is a complete row. Let’s change it into United States.
ramen$country[ramen$country=="USA"]<- "United States"
ramen$country %>% table() %>% sort(decreasing = T)
## .
## Japan United States South Korea Taiwan China
## 532 383 357 330 207
## Thailand Malaysia Hong Kong Indonesia Singapore
## 205 182 155 150 134
## Vietnam UK Philippines Canada India
## 112 69 49 48 41
## Germany Mexico Australia Netherlands Myanmar
## 28 28 25 16 14
## Nepal Brazil Bangladesh Hungary Pakistan
## 14 12 11 9 9
## Colombia Poland Cambodia Sarawak Fiji
## 6 6 5 5 4
## France Holland Dubai Finland Italy
## 4 4 3 3 3
## Russia Sweden Ukraine Estonia Ghana
## 3 3 3 2 2
## Nigeria New Zealand Phlippines
## 2 1 1
According to the data description, rating interval is 0.25. Let’s
remove the rows with weird stars value.
ramen %>% filter(stars%%0.25 != 0)
ramen %>% filter(stars%%0.25 != 0) %>% select(country) %>% table()
## country
## China Japan Malaysia South Korea Taiwan
## 1 1 1 7 5
## Thailand United States Vietnam
## 6 1 1
ramen <- ramen %>% filter(stars%%0.25 == 0)
Now, subset the data: only from the countries with more than 200 products.
ramen200<- ramen[complete.cases(ramen),] %>%
group_by(country) %>% mutate(productCount = n()) %>%
filter(productCount>200)
head(ramen200, 10)
ramen200%>%is.na()%>%table
## .
## FALSE
## 12467
ramen200$country %>% table() %>% sort(decreasing = T)
## .
## Japan United States South Korea Taiwan China
## 528 374 348 325 206
Check the summary of the data first.
summary(ramen200)
## review_number brand variety style
## Min. : 1 Length:1781 Length:1781 Length:1781
## 1st Qu.: 747 Class :character Class :character Class :character
## Median :1594 Mode :character Mode :character Mode :character
## Mean :1612
## 3rd Qu.:2535
## Max. :3179
## country stars productCount
## Length:1781 Min. :0.000 Min. :206.0
## Class :character 1st Qu.:3.500 1st Qu.:325.0
## Mode :character Median :3.750 Median :374.0
## Mean :3.743 Mean :386.2
## 3rd Qu.:4.500 3rd Qu.:528.0
## Max. :5.000 Max. :528.0
ramen200_summ<- ramen200 %>% group_by(country, stars) %>% summarise(ratingCount =n()) %>% mutate(ratingRation = ratingCount/sum(ratingCount))
ramen200_summ
Or, try tapply to show simple statistics for each
country
tapply(values, index, function): operate
function on values for each group in
index
tapply(ramen200$stars, ramen200$country, summary)
## $China
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 3.250 3.750 3.473 4.188 5.000
##
## $Japan
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 3.500 4.000 3.913 4.750 5.000
##
## $`South Korea`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 3.500 3.750 3.843 4.250 5.000
##
## $Taiwan
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 3.250 4.000 3.801 5.000 5.000
##
## $`United States`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 3.000 3.750 3.509 4.188 5.000
tapply(ramen200$stars, ramen200$country, table)
## $China
##
## 0 0.25 0.5 1 1.25 1.5 1.75 2 2.25 2.5 2.75 3 3.25 3.5 3.75 4
## 6 3 2 3 1 4 4 3 2 3 8 11 11 26 41 26
## 4.25 4.5 4.75 5
## 18 14 2 18
##
## $Japan
##
## 0 0.25 0.5 0.75 1 1.25 1.5 1.75 2 2.25 2.5 2.75 3 3.25 3.5 3.75
## 4 3 2 1 6 1 5 1 7 2 12 15 21 20 69 63
## 4 4.25 4.5 4.75 5
## 62 39 58 32 105
##
## $`South Korea`
##
## 0 0.5 1 1.75 2 2.25 2.5 2.75 3 3.25 3.5 3.75 4 4.25 4.5 4.75
## 2 1 2 1 9 3 7 7 11 19 55 65 59 24 20 10
## 5
## 53
##
## $Taiwan
##
## 0 0.25 0.5 1 1.25 1.5 1.75 2 2.25 2.5 2.75 3 3.25 3.5 3.75 4
## 4 1 1 6 4 6 4 7 3 7 8 9 26 28 45 34
## 4.25 4.5 4.75 5
## 23 16 7 86
##
## $`United States`
##
## 0 0.25 0.5 1 1.25 1.5 1.75 2 2.25 2.5 2.75 3 3.25 3.5 3.75 4
## 8 3 1 2 1 9 6 15 5 10 16 21 30 53 55 45
## 4.25 4.5 4.75 5
## 23 19 7 45
Well…?
Role of data visualization and its impact.
To get your audience to understand your data effectively, you should select proper chart types.
plot(y, x, main="title", xlab="x-axis label", ylab="y-axis label"):
scatter plot (산점도).
y and x should be numericy only, you can see the distribution of
individual values.y and x together, you can see the
relation between them.plot(ramen200$stars)
plot(ramen200$stars, main="Distribution of Ratings: all 5 countries", ylab="Rating")
barplot(height, main="title", xlab="x-axis label", ylab="y-axis label"):
bar chart (막대그래프)
height.
height is y value. for formula,
barplot(formula, data, main="title", xlab="x-axis label", ylab="y-axis label")
barplot(ramen200$stars)
barplot(table(ramen200$stars), main="Distribution of Ratings: all 5 countries", ylab="Frequency", xlab="Rating")
barplot(data = ramen200_summ, ratingCount ~ stars +country)
hist(x, main="title",xlab="x-axis label", ylab="y-axis label"):
histogram
x values.
frequency of x is y value. par(mfrow = c(1, 3))
hist(ramen200$stars, main = "Histogram")
barplot(ramen200$stars, main = "Barplot")
barplot(table(ramen200$stars), main = "Barplot: table")
par(mfrow = c(1, 2))
hist(ramen200$stars, main = "Histogram", breaks = seq(0,5,by=0.25))
barplot(table(ramen200$stars), main = "Barplot: table")
pie(x, labels = names(x), main="title") : pie chart
x is displayed in order. If
values have names, it is labeled for the correspoding slice, otherwise,
numbered from 1.pie(ramen200$stars)
pie(ramen200$stars[1:10])
pie(table(ramen200$stars[1:10]))
pie(table(ramen200$stars), main="Distribution of Ratings: all 5 countries")
library(ggplot2)
ggplot(ramen200[order(ramen200$stars),], aes(x = country, y = stars, fill = stars)) +
geom_bar(stat = "identity") +
scale_fill_gradient2(low = "green", high = "red", mid = "yellow", midpoint =2.5) +
labs(
title = "Distribution of Ratings by Country",
x = "Country",
y = "Rating")
#install.packages("ggbeeswarm")
library(ggbeeswarm)
ggplot(ramen200, aes(x = country, y = stars, color = country)) +
geom_beeswarm(cex = 0.25 , alpha=0.8, show.legend = FALSE) +
labs(
title = "Distribution of Ratings by Country",
x = "Country",
y = "Rating"
)